18:34
2026-06-04
lesswrong.com
artificial-intelligence
Building Better Activation Oracles
Researchers have improved Activation Oracles (AOs)—fine-tuned LLMs that answer natural language questions about a target model's internal activations—by training on on-policy rollouts, using a higher-…